Road Traffic crash data are useful tools to support the development and implementation of road safety programs that leads to reduced road traffic accidents and in turn saves people’s lives. This report represents my analysis for the Victoria road crashes from July 2014 to June 2019. Through the various data transformation and visualisation tools I will look for data insights and share my observation with Victoria Department of Transport.
This dataset documents crashes on Victorian roads during the five year reporting period from July 2014 to June 2019. This data gives me the opportunity to analyse Victorian fatal and injury crash data based on time, locations, conditions, crash type, road user type etc. For this analysis I will mainly focus on finding answers to the following questions by means of exploration and visualisation.
Accident frequency: How has the accident frequency changed over the years?
Daily traffic trend: Was a particular day of the week more prone to accidents?
Rush hours’ impact on accidents: Was a particular time of the day(morning rush/evening rush) more prone to accidents?
Most impacted regions: How are the accidents distributed across regions? How does it trend over the years?
Alcohol consumption time and light condition: How are alcohol time and overall visibility contributed to the accidents?
Effect of different speed limits: How has the various speed zones affected the severity of the accidents?
Accident type: How does the collision type affect severity of the accidents?
The outcome of the analysis would guide Victoria Department of Transport manage their road safety campaign more efficiently and target investment and resources in the core areas of need and concern.
# importing necessary libraries
library(tidyverse)
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'pillar'
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'hms'
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(visdat)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# reading data from csv file
crashes_data <- read_csv("Road_Crashes_for_five_Years_-_Victoria.csv")
## Rows: 77513 Columns: 65── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (29): ACCIDENT_NO, ABS_CODE, ACCIDENT_STATUS, ALCOHOLTIME, ACCIDENT_TYP...
## dbl (34): X, Y, OBJECTID, NODE_ID, LONGITUDE, LATITUDE, VICGRID_X, VICGRID_...
## date (1): ACCIDENT_DATE
## time (1): ACCIDENT_TIME
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# changing variable names to lower cases for better legibility
names(crashes_data) <- tolower(names(crashes_data))
# returning first six rows
head(crashes_data)
## # A tibble: 6 × 65
## x y objectid accident_no abs_code accident_status accident_date
## <dbl> <dbl> <dbl> <chr> <chr> <chr> <date>
## 1 2498217. 2454870. 1 T20140013514 ABS to … Unfinished 2014-07-01
## 2 2502935. 2423005. 2 T20140013549 ABS to … Finished 2014-07-02
## 3 2520154. 2395622. 3 T20140013561 ABS to … Finished 2014-07-02
## 4 2473719. 2402357. 4 T20140013563 ABS to … Finished 2014-07-02
## 5 2484074. 2409124. 5 T20140013567 ABS to … Finished 2014-07-02
## 6 2496691. 2406552. 6 T20140013582 ABS to … Finished 2014-07-02
## # … with 58 more variables: accident_time <time>, alcoholtime <chr>,
## # accident_type <chr>, day_of_week <chr>, dca_code <chr>, hit_run_flag <chr>,
## # light_condition <chr>, police_attend <chr>, road_geometry <chr>,
## # severity <chr>, speed_zone <chr>, run_offroad <chr>, node_id <dbl>,
## # longitude <dbl>, latitude <dbl>, node_type <chr>, lga_name <chr>,
## # region_name <chr>, vicgrid_x <dbl>, vicgrid_y <dbl>, total_persons <dbl>,
## # inj_or_fatal <dbl>, fatality <dbl>, seriousinjury <dbl>, …
# printing structure of data
str(crashes_data)
## spec_tbl_df [77,513 × 65] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ x : num [1:77513] 2498217 2502935 2520154 2473719 2484074 ...
## $ y : num [1:77513] 2454870 2423005 2395622 2402357 2409124 ...
## $ objectid : num [1:77513] 1 2 3 4 5 6 7 8 9 10 ...
## $ accident_no : chr [1:77513] "T20140013514" "T20140013549" "T20140013561" "T20140013563" ...
## $ abs_code : chr [1:77513] "ABS to receive accident" "ABS to receive accident" "ABS to receive accident" "ABS to receive accident" ...
## $ accident_status : chr [1:77513] "Unfinished" "Finished" "Finished" "Finished" ...
## $ accident_date : Date[1:77513], format: "2014-07-01" "2014-07-02" ...
## $ accident_time : 'hms' num [1:77513] 13:09:00 00:01:00 08:30:00 10:08:00 ...
## ..- attr(*, "units")= chr "secs"
## $ alcoholtime : chr [1:77513] "No" "Yes" "No" "No" ...
## $ accident_type : chr [1:77513] "Collision with vehicle" "Collision with vehicle" "collision with some other object" "Collision with vehicle" ...
## $ day_of_week : chr [1:77513] "Monday" "Wednesday" "Wednesday" "Wednesday" ...
## $ dca_code : chr [1:77513] "RIGHT THROUGH" "U TURN" "STRUCK OBJECT ON CARRIAGEWAY" "CROSS TRAFFIC(INTERSECTIONS ONLY)" ...
## $ hit_run_flag : chr [1:77513] "No" "No" "No" "Yes" ...
## $ light_condition : chr [1:77513] "Day" "Dark Street lights on" "Dusk/Dawn" "Unk." ...
## $ police_attend : chr [1:77513] "Yes" "Yes" "Yes" "No" ...
## $ road_geometry : chr [1:77513] "Unknown" "Not at intersection" "Not at intersection" "Cross intersection" ...
## $ severity : chr [1:77513] "Other injury accident" "Other injury accident" "Other injury accident" "Other injury accident" ...
## $ speed_zone : chr [1:77513] "60 km/hr" "60 km/hr" "50 km/hr" "Not known" ...
## $ run_offroad : chr [1:77513] "No" "No" "No" "No" ...
## $ node_id : num [1:77513] 49465 272595 273609 44382 38531 ...
## $ longitude : num [1:77513] 145 -1 NA -1 -1 ...
## $ latitude : num [1:77513] -37.4 -1 NA -1 -1 ...
## $ node_type : chr [1:77513] "Intersection" "Non-Intersection" "Non-Intersection" "Intersection" ...
## $ lga_name : chr [1:77513] "MITCHELL" "WHITTLESEA" "KNOX" "WYNDHAM" ...
## $ region_name : chr [1:77513] "NORTHERN REGION" "METROPOLITAN NORTH WEST REGION" "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN NORTH WEST REGION" ...
## $ vicgrid_x : num [1:77513] 2498217 2502935 2520154 2473719 2484074 ...
## $ vicgrid_y : num [1:77513] 2454870 2423005 2395622 2402357 2409124 ...
## $ total_persons : num [1:77513] 4 4 1 3 2 1 2 5 4 3 ...
## $ inj_or_fatal : num [1:77513] 1 3 1 1 1 1 1 2 1 1 ...
## $ fatality : num [1:77513] 0 0 0 0 0 0 0 0 0 0 ...
## $ seriousinjury : num [1:77513] 0 0 0 0 0 1 1 1 0 0 ...
## $ otherinjury : num [1:77513] 1 3 1 1 1 0 0 1 1 1 ...
## $ noninjured : num [1:77513] 3 1 0 2 1 0 1 3 3 2 ...
## $ males : num [1:77513] 1 0 0 2 2 1 1 3 2 0 ...
## $ females : num [1:77513] 3 3 1 1 0 0 1 2 2 3 ...
## $ bicyclist : num [1:77513] 0 0 0 0 0 0 0 0 0 0 ...
## $ passenger : num [1:77513] 2 2 0 1 0 0 0 2 2 1 ...
## $ driver : num [1:77513] 2 2 1 2 2 0 2 3 2 2 ...
## $ pedestrian : num [1:77513] 0 0 0 0 0 0 0 0 0 0 ...
## $ pillion : num [1:77513] 0 0 0 0 0 0 0 0 0 0 ...
## $ motorist : num [1:77513] 0 0 0 0 0 1 0 0 0 0 ...
## $ unknown : num [1:77513] 0 0 0 0 0 0 0 0 0 0 ...
## $ ped_cyclist_5_12 : num [1:77513] 0 0 0 0 0 0 0 0 0 0 ...
## $ ped_cyclist_13_18: num [1:77513] 0 0 0 0 0 0 0 0 0 0 ...
## $ old_pedestrian : num [1:77513] 0 0 0 0 0 0 0 0 0 0 ...
## $ old_driver : num [1:77513] 0 0 0 0 1 0 0 0 0 0 ...
## $ young_driver : num [1:77513] 0 1 0 0 0 0 0 0 1 0 ...
## $ alcohol_related : chr [1:77513] "No" "No" "No" "No" ...
## $ unlicencsed : num [1:77513] 1 0 0 0 0 0 0 0 0 0 ...
## $ no_of_vehicles : num [1:77513] 2 2 1 2 2 1 2 3 2 2 ...
## $ heavyvehicle : num [1:77513] 0 0 0 0 1 0 0 0 0 0 ...
## $ passengervehicle : num [1:77513] 2 2 1 2 1 0 1 3 2 2 ...
## $ motorcycle : num [1:77513] 0 0 0 0 0 1 0 0 0 0 ...
## $ publicvehicle : num [1:77513] 0 0 0 0 0 0 0 0 0 0 ...
## $ deg_urban_name : chr [1:77513] "TOWNS" "MELB_URBAN" "MELB_URBAN" "MELB_URBAN" ...
## $ deg_urban_all : chr [1:77513] "TOWNS" "MELB_URBAN" "MELB_URBAN" "MELB_URBAN" ...
## $ lga_name_all : chr [1:77513] "MITCHELL" "WHITTLESEA" "KNOX" "WYNDHAM" ...
## $ region_name_all : chr [1:77513] "NORTHERN REGION" "METROPOLITAN NORTH WEST REGION" "METROPOLITAN SOUTH EAST REGION" "METROPOLITAN NORTH WEST REGION" ...
## $ srns : chr [1:77513] "B" NA NA "C" ...
## $ srns_all : chr [1:77513] "B" NA NA "C" ...
## $ rma : chr [1:77513] "Arterial Highway" "Arterial Other" "Local Road" "Arterial Other" ...
## $ rma_all : chr [1:77513] "Local Road,Arterial Highway" "Arterial Other" "Local Road" "Arterial Other,Local Road" ...
## $ divided : chr [1:77513] "Undivided" "Undivided" "Undivided" "Divided" ...
## $ divided_all : chr [1:77513] "Undivided" "Undivided" "Undivided" "Undivided,Divided" ...
## $ stat_div_name : chr [1:77513] "Country" "Metro" "Metro" "Metro" ...
## - attr(*, "spec")=
## .. cols(
## .. X = col_double(),
## .. Y = col_double(),
## .. OBJECTID = col_double(),
## .. ACCIDENT_NO = col_character(),
## .. ABS_CODE = col_character(),
## .. ACCIDENT_STATUS = col_character(),
## .. ACCIDENT_DATE = col_date(format = ""),
## .. ACCIDENT_TIME = col_time(format = ""),
## .. ALCOHOLTIME = col_character(),
## .. ACCIDENT_TYPE = col_character(),
## .. DAY_OF_WEEK = col_character(),
## .. DCA_CODE = col_character(),
## .. HIT_RUN_FLAG = col_character(),
## .. LIGHT_CONDITION = col_character(),
## .. POLICE_ATTEND = col_character(),
## .. ROAD_GEOMETRY = col_character(),
## .. SEVERITY = col_character(),
## .. SPEED_ZONE = col_character(),
## .. RUN_OFFROAD = col_character(),
## .. NODE_ID = col_double(),
## .. LONGITUDE = col_double(),
## .. LATITUDE = col_double(),
## .. NODE_TYPE = col_character(),
## .. LGA_NAME = col_character(),
## .. REGION_NAME = col_character(),
## .. VICGRID_X = col_double(),
## .. VICGRID_Y = col_double(),
## .. TOTAL_PERSONS = col_double(),
## .. INJ_OR_FATAL = col_double(),
## .. FATALITY = col_double(),
## .. SERIOUSINJURY = col_double(),
## .. OTHERINJURY = col_double(),
## .. NONINJURED = col_double(),
## .. MALES = col_double(),
## .. FEMALES = col_double(),
## .. BICYCLIST = col_double(),
## .. PASSENGER = col_double(),
## .. DRIVER = col_double(),
## .. PEDESTRIAN = col_double(),
## .. PILLION = col_double(),
## .. MOTORIST = col_double(),
## .. UNKNOWN = col_double(),
## .. PED_CYCLIST_5_12 = col_double(),
## .. PED_CYCLIST_13_18 = col_double(),
## .. OLD_PEDESTRIAN = col_double(),
## .. OLD_DRIVER = col_double(),
## .. YOUNG_DRIVER = col_double(),
## .. ALCOHOL_RELATED = col_character(),
## .. UNLICENCSED = col_double(),
## .. NO_OF_VEHICLES = col_double(),
## .. HEAVYVEHICLE = col_double(),
## .. PASSENGERVEHICLE = col_double(),
## .. MOTORCYCLE = col_double(),
## .. PUBLICVEHICLE = col_double(),
## .. DEG_URBAN_NAME = col_character(),
## .. DEG_URBAN_ALL = col_character(),
## .. LGA_NAME_ALL = col_character(),
## .. REGION_NAME_ALL = col_character(),
## .. SRNS = col_character(),
## .. SRNS_ALL = col_character(),
## .. RMA = col_character(),
## .. RMA_ALL = col_character(),
## .. DIVIDED = col_character(),
## .. DIVIDED_ALL = col_character(),
## .. STAT_DIV_NAME = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
dim(crashes_data)
## [1] 77513 65
The Road Crashes data for Victoria from July 2014 to June 2019 was downloaded from Victoria Department of Transport Open Data Hub.. The data is provided by VicRoads for educational and research purposes. The data was downloaded in csv format and it contains a single file. The dataset has 77513 observations and 65 features. The features include categorical and numerical variables. I get the impression that great deal of effort was made to collect this data and as a result, for most part it looks thorough and accurate. From the high level perspective, I observed that number of road crashes have been steadily reduced in last 3-4 years, vast majority of the accidents though still happen in the rush hours and concentrated in two Metropolitan regions. Due to the time constraints this report will analyse the Victorian fatal and injury crash data based on these key variables - accident_date, accident_time,alcoholtime, accident_type, day_of_week, light_condition, severity, speed_zone, total_persons, longitude, latitude and region_name.
# selecting only the variables necessary for this analysis
sel_crashes_data <- crashes_data %>%
select(c("accident_date", "accident_time", "alcoholtime",
"accident_type", "day_of_week", "light_condition", "severity",
"speed_zone", "total_persons", "longitude", "latitude",
"region_name"))
As part of my research, I traversed through 14 non-identical plots but only present the most relevant ones in this report. To start with I will run summary function to show a set of descriptive statistics for every variable.
# extract summary data of individual objects
summary(sel_crashes_data)
## accident_date accident_time alcoholtime accident_type
## Min. :2014-07-01 Length:77513 Length:77513 Length:77513
## 1st Qu.:2015-08-27 Class1:hms Class :character Class :character
## Median :2016-10-11 Class2:difftime Mode :character Mode :character
## Mean :2016-11-14 Mode :numeric
## 3rd Qu.:2018-02-04
## Max. :2019-06-30
##
## day_of_week light_condition severity speed_zone
## Length:77513 Length:77513 Length:77513 Length:77513
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## total_persons longitude latitude region_name
## Min. : 1.000 Min. : -1.0 Min. :-39.02 Length:77513
## 1st Qu.: 2.000 1st Qu.:144.8 1st Qu.:-37.95 Class :character
## Median : 2.000 Median :145.0 Median :-37.81 Mode :character
## Mean : 2.346 Mean :139.7 Mean :-36.47
## 3rd Qu.: 3.000 3rd Qu.:145.2 3rd Qu.:-37.67
## Max. :89.000 Max. :149.8 Max. : -1.00
## NA's :1 NA's :1106 NA's :1106
# get the number of missing values per column
colSums(is.na(sel_crashes_data))
## accident_date accident_time alcoholtime accident_type day_of_week
## 0 0 0 0 2360
## light_condition severity speed_zone total_persons longitude
## 0 1 0 1 1106
## latitude region_name
## 1106 600
From running above function, we see 2360 values missing in day_of_week, 600 values missing in region_name and just 1 each missing in few other columns. I can handle these missing values by just dropping all the rows with NA’s since they represent very low percentage (2360/77000 = 3.06%) of overall data. For this analysis though I will impute the missing values by filling the columns with mean/mode values to make use of all the data and get better distribution.
# imputing null values with median value for numerical columns and mode value for categorical columns
sel_crashes_data[] <- lapply(sel_crashes_data, function(x) {
if (class(x) == "character") {
xtab <- table(x)
xmode <- names(which(xtab == max(xtab)))
x[is.na(x)] <- xmode
}
else {
x[is.na(x)] <- median(x, na.rm = TRUE)
}
x
})
## Warning in if (class(x) == "character") {: the condition has length > 1 and only
## the first element will be used
# Checking the missing data in each column
vis_miss(sel_crashes_data, 'warn_large_data' = FALSE)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
If we check for missing values using vis_miss function from visdat library, we see there is no missing values exist in the data.
Now I will look for the data points that are really far from the rest of the data points, also called outliers by drawing a boxplot. Removing the outliers will reduce skewness of the data and improve statistical significance.
# checking for outliers
boxplot(sel_crashes_data[,c("total_persons")])
It seems like there is a lot of outliers represented by overlapping points over the box near the ‘0’ line. One particular value in total_persons though really distanced itself from the rest. I will remove this value by filtering the dataset and keep the rest to ensure data integrity.
# removing outliers from the data
sel_crashes_data <- sel_crashes_data %>%
filter(total_persons < 89)
# after removing the outliers
boxplot(sel_crashes_data[,c("total_persons")])
From the above boxplot, it is pretty evident that the extreme value has now been removed.
Now I will do further exploration and visualisation of the data for analysis based on the questions defined in problem description.
How has accident frequency changed over the years?
I will first extract ‘year’ from ‘accident_date’ and then use aggregate functions to summarise ‘number_of_accidents’ in a tibble format. Then create a line plot using geom_line function from ggplot2 library to show the distribution of the accidents over the years. In this instance, line plot will be really useful to show trends in number of accidents from 2014 to 2019. By mapping year to x axis and number of accidents to y axis, we will see yearly trend in the accident number.
# extracting year from accident_date and creating a separate 'year' column
sel_crashes_data <- sel_crashes_data %>%
mutate(year = format(sel_crashes_data$accident_date, format = "%Y"))
# using aggregate functions to summarise the number of accidents
# and assigning it to the variable 'accidents_by_year'
accidents_by_year <- sel_crashes_data %>%
group_by(year) %>%
summarise(number_of_accidents = n())
accidents_by_year$year <- as.numeric(accidents_by_year$year)
accidents_by_year
## # A tibble: 6 × 2
## year number_of_accidents
## <dbl> <int>
## 1 2014 8321
## 2 2015 17020
## 3 2016 17109
## 4 2017 14459
## 5 2018 13682
## 6 2019 6921
# visualising the number of accidents by each year using geom_line
year_plot <- ggplot(accidents_by_year, aes(x = year, y = number_of_accidents)) +
# white background with grid lines
theme_bw() +
geom_line() +
labs(x = "Year",
y = "Number of Accidents",
title = "Distribution of Road Crashes By Year") +
# center aligning the plot title
theme(plot.title = element_text(hjust = 0.5))
# adding interactive elements to the plot
interactive_year_plot <- ggplotly(year_plot)
interactive_year_plot
Was a particular day of the week more prone to accidents?
To generate relevant data, I will use aggregate functions to summarise ‘number_of_accidents’ based on the variable ‘day_of_week’. Here, We have a categorical variable in ‘day_of_week’ and numeric variable in ‘number_of_accidents’. So a barplot will be quite useful to display this relationship. We will find if any particular day/s of the week is/are more prone to the accidents by mapping day of week to x axis and number of accidents to y axis.
# using aggregate functions to summarise the number of accidents
# based on day of week and assigning it to the variable 'accidents_by_day'
accidents_by_day <- sel_crashes_data %>%
group_by(day_of_week) %>%
summarise(number_of_accidents = n())
accidents_by_day
## # A tibble: 7 × 2
## day_of_week number_of_accidents
## <chr> <int>
## 1 Friday 14182
## 2 Monday 10558
## 3 Saturday 8509
## 4 Sunday 10170
## 5 Thursday 11675
## 6 Tuesday 11033
## 7 Wednesday 11385
# visualising the number of accidents by day of week using geom_bar
day_plot <- ggplot(accidents_by_day, aes(x = day_of_week, y = number_of_accidents)) +
# white background with grid lines
theme_bw() +
geom_bar(stat = "identity", color = "white") +
labs(x = "Day of Week",
y = "Number of Accidents",
title = "Distribution of Road Crashes By Day of Week") +
# center aligning the plot title
theme(plot.title = element_text(hjust = 0.5))
# adding interactive elements to the plot
interactive_day_plot <- ggplotly(day_plot)
interactive_day_plot
Was a particular time of the day(morning rush/evening rush) more prone to accidents?
In order to create a tibble of hour and number of accidents, first I need to extract hour from the variable ‘accident_time’. Then use aggregate functions to calculate the number of accidents recorded by each hour. I will mostly use bar plots for the analysis since we have many categorical variables in the dataset. We can find out how much of an impact rush hours has on the number of accidents by mapping hour to x axis and number of accidents to y axis.
# creating new 'hour' column
sel_crashes_data$hour <- format(strptime(sel_crashes_data$accident_time, "%H:%M:%S"), "%H")
# using aggregate functions to summarise the number of accidents
# by the hour and assigning it to the variable 'accidents_by_hour'
accidents_by_hour <- sel_crashes_data %>%
group_by(hour) %>%
summarise(number_of_accidents = n())
accidents_by_hour
## # A tibble: 24 × 2
## hour number_of_accidents
## <chr> <int>
## 1 00 981
## 2 01 853
## 3 02 669
## 4 03 585
## 5 04 581
## 6 05 1042
## 7 06 2138
## 8 07 3292
## 9 08 4841
## 10 09 4139
## # … with 14 more rows
# visualising the number of accidents by hour using geom_bar
hour_plot <- ggplot(accidents_by_hour, aes(x = hour, y = number_of_accidents)) +
# white background with grid lines
theme_bw() +
geom_bar(stat = "identity") +
labs(x = "Hour",
y = "Number of Accidents",
title = "Distribution of Road Crashes By Hour") +
# center aligning the plot title
theme(plot.title = element_text(hjust = 0.5))
# adding interactive elements to the plot
interactive_hour_plot <- ggplotly(hour_plot)
interactive_hour_plot
How are the accidents distributed across regions? How does it trend over the years?
First, filter out the ‘NA’ values from ‘region_name’ to better represent the available data. Next step I map region name to x axis and assign ‘year’ to fill on a bar chart to see the relative proportion of yearly accidents by the region name. Position adjustment is set to ‘dodge’ to arrange elements side by side. This will allow us to see the yearly trends in number of accidents for each individual region.
# using filter to remove 'region_name' with 'NA' values
sel_crashes_data <- sel_crashes_data %>%
filter(region_name != "NA")
# visualising the number of accidents by year and region using geom_bar
region_plot <- ggplot(sel_crashes_data, aes(x = region_name, fill = year)) +
# white background with grid lines
theme_bw() +
geom_bar(position = "dodge") +
# adapting a colour palette particularly suited to display discreet values
scale_fill_brewer(palette = "Set2") +
labs(x = "Region",
y = "Number of Accidents",
title = "Distribution of Road Crashes By Year and Region",
fill = " Year") +
# center aligning the plot title
theme(plot.title = element_text(hjust = 0.5)) +
# rotating and spacing x label
theme(axis.text.x = element_text(angle = 30, vjust=0.5, hjust=1))
# adding interactive elements to the plot
interactive_region_plot <- ggplotly(region_plot)
interactive_region_plot
How are alcohol time and overall visibility contributed to the accidents?
I will again use filter function to remove insignificant values from ‘light_condition’ to get more coherent visualisation outcome. Then map light condition to x axis and assign ‘alcoholtime’ to fill on a bar chart to see the relative proportion of accidents by the light condition and alcohol time. Alcohol times are defined as the following: Monday - Thursday 00:00 - 06:00 hours & 18:00 - 23:59 hours, Friday 00:00 - 06:00 hours & 16:00 - 23:59 hours, Saturday 00:00 - 08:00 hours & 14:00 - 23:59 hours, Sunday 00:00 - 10:00 hours & 16:00 - 23:59 hours. I hope to find any correlation between light condition/alcohol time and number of accidents.
# using filter to remove 'light_condition' with 'Unk' and 'Dark Street lights unknown' values
sel_crashes_data <- sel_crashes_data %>%
filter(light_condition != "Unk.", light_condition != "Dark Street lights unknown")
# visualising the factors of alcohol time and light condition to number of accidents using geom_bar
light_plot <- ggplot(sel_crashes_data, aes(x = light_condition, fill = alcoholtime)) +
# white background with grid lines
theme_bw() +
geom_bar() +
# dividing plot into rectangular subplots
facet_wrap(~ alcoholtime) +
# adapting a colour palette particularly suited to display discreet values
scale_fill_brewer(palette = "Set2") +
labs(x = "Light Condition",
y = "Number of Accidents",
title = "Alcohol Time and Light Condition Factor to Vehicle Crashes",
fill = " Alcohol") +
# center aligning the plot title
theme(plot.title = element_text(hjust = 0.5)) +
# rotating and spacing x label
theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust=1)) +
# set legend position to top
theme(legend.position = "top")
# adding interactive elements to the plot
interactive_light_plot <- ggplotly(light_plot)
interactive_light_plot
How has the various speed zones affected the severity of the accidents?
We can find the impact of different speed limits on the severity of the accidents by producing a bar plot with aesthetic mapping of severity to x axis and assigning ‘speed_zone’ to fill. This should give us a clear picture if higher speed limit has any correlation with the severity of the accidents.
# using filter to remove insignificant values
sel_crashes_data <- sel_crashes_data %>%
filter(speed_zone != "Camping grounds or off road", speed_zone != "Not known",
speed_zone != "Other speed limit", severity != "Non injury accident",
severity != "NA")
# visualising accident severity by different speed zones using geom_bar
speed_plot <- ggplot(sel_crashes_data, aes(x = severity, fill = speed_zone)) +
# white background with grid lines
theme_bw() +
geom_bar(position = "dodge") +
# adapting a colour palette particularly suited to display discreet values
scale_fill_brewer(palette = "Set3") +
labs(x = "Severity",
y = "Number of Accidents",
title = "Severity Distribution by Speed Limit",
fill = " Speed Zone") +
theme(legend.position = "bottom") +
# center aligning the plot title
theme(plot.title = element_text(hjust = 0.5))
# adding interactive elements to the plot
interactive_speed_plot <- ggplotly(speed_plot)
interactive_speed_plot
How does the collision type affect severity of the accidents?
To help us finding the answer for this, I map ‘severity’ to x axis and assign ‘accident_type’ to aesthetics parameter ‘fill’ on a bar chart. Position adjustment is set to ‘dodge’ to arrange elements side by side. This should show us accident severity relative to collusion type.
# visualising accident severity by different speed zones using geom_bar
accident_type_plot <- ggplot(sel_crashes_data, aes(x = severity, fill = accident_type)) +
# white background with grid lines
theme_bw() +
geom_bar(position = "dodge") +
# adapting a colour palette particularly suited to display discreet values
scale_fill_brewer(palette = "Set3") +
labs(x = "Severity",
y = "Number of Accidents",
title = "Accident Severity by Collision Type",
fill = " Accident Type") +
# center aligning the plot title
theme(plot.title = element_text(hjust = 0.5))
# adding interactive elements to the plot
interactive_accident_type_plot <- ggplotly(accident_type_plot)
interactive_accident_type_plot
These are the outcome of my analysis of ‘Victoria Crashes Data’ through the exploration and visualisation processes above:
As illustrated in the line chart ‘Distribution of Road Crashes By Year’, the yearly total peaked in 2014-15, steadily declined in 2017-18 and then significantly dropped in 2019. I suspect this drop in numbers is due to the strong road safety campaign undertaken by Department of Transport, presence of more highway patrol on the road and use of various speed deterrents including red light/speed cameras might have been a factor in reducing this number.
It seems number of accidents are pretty even throughout the weekdays except for on Friday. This makes sense since people are often more tired towards the end of the week, rushing to get home or away for a weekend trip. This may lead to less concentration on the road and more chance of involving in an accident. Meanwhile, Saturday seems to be least likely day for an accident to happen.
It is evident that most of the accidents happen during the evening rush hours between 3pm and 6pm, not far behind is morning rush hour between 8am and 9am. Obviously, during the rush hours there is higher volumes of traffic on the road and drivers are more stressed to get to work in the morning and rushing home in the evening.
Western Region has the lowest number of road crashes from 2014 to 2019, followed by North Eastern Region. On the other hand, Metropolitan South East Region has consistently high number of crashes, closely followed by Metropolitan North West Region. Both the Metropolitan regions had a large decrease from 2018 to 2019 - shows that message is reaching across to the community and people are being more cautious on the road. On the contrary, there was a steady reduction in number of accidents from 2015 to 2018 after really peaked in 2015. Two Metropolitan Regions are the most populated in Victoria which is responsible for high number of accidents.
Vast majority of the accidents happen during the day and it is not greatly impacted by alcohol time. Because most hours of the alcohol time belong to night time, hard to explain if the accidents at night has any relation with alcohol time. We certainly need more relevant data and exploration on this front.
Major proportion of the accidents in 100 km/hr zone turn into fatal accidents. As the speed limit reduces, proportion of number of fatal accidents compared to total number of accidents decreases. Most accidents happen in 60 km/hr zone probably due to the fact that 60km/hr covers most of the urban traffic areas.
Perhaps it is not surprising to see collision with vehicle is the major contributor to overall accidents numbers, followed by collision with a fixed object. But in terms of number of fatal accidents, these factors are evenly distributed. Disheartening to find out large numbers of pedestrians involved in serious and other injury accidents probably due to excessive mobile usage, jaywalking and not paying close attention while crossing the road.
Data collection for ‘Road Crashes Victoria’ began in July 2014 and regularly being updated ever since. I did an analysis from 2014 to 2019 where data is available for the entire year. The following trends were observed in the Victoria Road Crashes data:
Plotly. (2021). Plotly R open source graphing library. https://plotly.com/r/
R Studio. (2021). RStudio cheatsheets. https://www.rstudio.com/resources/cheatsheets/
Scientific Research. (2021). Road Traffic Crash Data: An overview on sources, problems and collection methods. https://www.scirp.org/journal/paperinformation.aspx?paperid=75975
Tierney, N. (2021). Getting started with naniar. http://naniar.njtierney.com/articles/getting-started-w-naniar.html#introduction
Tierney, N. (2020). RMarkdown for scientists. https://rmd4sci.njtierney.com/
Victoria Department of Transport Open Data Hub. (2021). Road crashes for five years - Victoria. https://vicroadsopendata-vicroadsmaps.opendata.arcgis.com/datasets/vicroadsmaps::road-crashes-for-five-years-victoria/about
Wickham, H. (2nd edition). Elegent graphics for data analysis. Springer.
Wickham, H. & Grolemund, G. (2017). R for data science. O’Reilly Media , Inc..